<<<<<<< HEAD <<<<<<< HEAD United States Data Science and Machine Learning Employees

Introduction

Data science is a growingly popular field. Typical job titles related to this field include business analyst, data analyst, data scientist, decision scientist, and data engineer. The below graph will show the interest over time of popular terms relating to data science. The numbers on the graph represent search interest relative to the highest point on the chart for the given time. A value of 100 is the peak popularity of the term. A value of 50 means that the term is half as popular. Similarly, a score of 0 means the term was less than 1% as popular as the peak. The graph’s data is worldwide search on Google from January 2014 to January 2018.

The term graph shows the increasing popularity of the terms data science and machine learning. There is also a downward trend in the usage of business intelligence. More data science and machine learning terms will be mentioned in this report.

This report begins with a general description of the dataset and then conducts exploratory data analysis. The analysis is divided by the number of variables analyzed. The goals of the report are to convey the insights from the data science dataset and to show knowledge of data science techniques and tools including R, RMD, statistics, and others. For data science or machine learning terms a reader does not know, check the appendix to learn the definition or to read a concise description.

Dataset

In 2017, Kaggle conducted an industry-wide survey to establish a comprehensive view of data science and machine learning. The resultant multiple choice dataset has 16,716 usable observations/responses of 228 variables from 171 countries and territories. The survey was live for 18 days. The dataset can show who is working with data, what’s happening at the cutting edge of machine learning, and how new data scientists can best break into the field.[1]

Every respondent did not see every question. In an attempt to ask relevant questions to each respondent, Kaggle asked work-related questions to employed data scientists and learning related questions to students. There is a column in the schema.csv file called “Asked” that describes who saw each question. You can learn more about the different segments used in the schema.csv file and RespondentTypeREADME.txt in the data tab.

While the dataset includes multiple files and countries, this report will focus on the cleaner dataset file multipleChoiceResponses.csv and respondents from the United States paid in USD. The final “mc_responses_usa” dataset includes 1240 responses of 23 variables. Since the United States is a net immigration country, meaning more people are moving to the U.S. than leaving it, it is likely that the data scientists that responded in the survey will stay in the US.

After limiting the large Kaggle dataset to a smaller one with only the variables that will be used, the dataset needed to be cleaned and manipulated. Some variables, like education, had its responses reworded to maintain consistency. Many variables were also converted to ordered factors, but compensation was converted into a numeric. This was done so that the report was cleaner, summary statistics could be calculated, or to form the appropriate charts.

The structure of the new dataset is shown below:

str(mc_responses_usa)
## 'data.frame':    1240 obs. of  22 variables:
##  $ Gender                     : Factor w/ 5 levels "","A different identity",..: 4 4 4 4 4 3 3 3 4 4 ...
##  $ Age                        : int  56 25 33 35 40 31 39 30 50 59 ...
##  $ CurrentJobTitle            : Factor w/ 17 levels "","Business Analyst",..: 10 14 15 8 15 15 12 2 12 6 ...
##  $ TitleFit                   : Ord.factor w/ 3 levels "Poorly"<"Fine"<..: 1 2 3 2 3 3 2 2 3 2 ...
##  $ MLToolNextYear             : Factor w/ 52 levels "","Amazon Machine Learning",..: 49 2 36 44 49 3 36 19 34 24 ...
##  $ MLMethodNextYear           : Factor w/ 26 levels "","Anomaly Detection",..: 15 7 7 7 7 8 4 15 24 19 ...
##  $ RecommendedLanguage        : Factor w/ 14 levels "","C/C++/C#",..: 9 9 7 9 9 10 10 9 10 9 ...
##  $ FormalEducation            : Ord.factor w/ 7 levels "High School"<..: 4 3 5 5 5 5 4 3 5 4 ...
##  $ Major                      : Factor w/ 16 levels "","A health science",..: 13 15 7 15 15 8 8 1 13 15 ...
##  $ Tenure                     : Ord.factor w/ 6 levels "I don't write code to analyze data"<..: 6 4 2 4 6 5 3 3 6 3 ...
##  $ ParentsEducation           : Ord.factor w/ 7 levels "High School"<..: 1 4 3 5 NA 4 3 3 1 3 ...
##  $ WorkDatasetSize            : Ord.factor w/ 15 levels "<1MB"<"1MB"<"10MB"<..: 5 5 7 3 7 5 4 NA 4 5 ...
##  $ TimeGatheringData          : int  50 0 0 30 60 30 80 60 20 60 ...
##  $ TimeModelBuilding          : num  20 80 0 20 20 30 10 10 25 10 ...
##  $ TimeProduction             : num  0 0 0 5 0 10 5 10 10 10 ...
##  $ TimeVisualizing            : num  10 20 0 15 20 10 5 10 25 10 ...
##  $ TimeFindingInsights        : num  20 0 0 30 0 20 0 10 20 10 ...
##  $ TimeOther                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AlgorithmUnderstandingLevel: Ord.factor w/ 6 levels "Enough to run the code / standard library"<..: 4 4 2 3 5 3 3 3 4 3 ...
##  $ CompensationAmount         : num  250000 20000 100000 133000 215000 83500 115000 80000 135000 75000 ...
##  $ SalaryChange               : Ord.factor w/ 9 levels "Other"<"I do not want to share information about my salary/compensation"<..: 9 7 9 8 9 9 3 8 8 8 ...
##  $ JobSatisfaction            : Ord.factor w/ 11 levels "I prefer not to share"<..: 11 7 8 9 9 8 11 11 9 9 ...

Univariate Plots

In this section, preliminary exploration of the dataset occurs, along with summaries of the data and univariate plots to understand the structure of the individual variables.

The first question that drove exploration of this data is, “What age is a typical data science worker?” The box plot and summary statistics show that the median age is 33 and the mean is 36. This shows that the there is a slight upward skew. In the box plot, the red dot marks the average age. The thick black line marks the median age. The box is the inner 50% quartiles. The bottom line marks the bottom 25% quartile (28 years old), and the top line marks the upper 75% quartile (42 years old). The circles then mark outliers. For example, there is an outlier at 1 and 72 which are the min and max. It is likely that 1 is a fake age because it is improbable that a toddler works in data science.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   28.00   33.00   36.34   42.00   72.00      10

If US high school graduation is common at age 18, college Bachelor graduation at 21, and Master’s at 23 and the lowest quartile is 28, do most of these data science workers have a Ph.D.?

##             High School Some college/university              Bachelor's 
##                       4                      34                     301 
##                Master's               Doctorate     Professional Degree 
##                     551                     328                      19 
##                   Other 
##                       3

The above graph shows that data science and machine learning workers have a fairly bell-shaped normal distribution centered at a Master’s degree. Most workers have a Master’s degree (44%), then a Doctorate (26%), and then a Bachelor’s (24%). Master’s gap is a huge 1.68x more than Doctorates, 1.83x more than Bachelor’s, and 29x more than a Professional degree.

This Bachelors: Masters: Doctorate ratio is unusual given that data science is not a licensed field were people are required to obtain a certain degree like medical doctors. It is also interesting because for ages 25 and over, the educational attainment in the United States is 8% higher for Bachelor’s, 32% lower for Master’s, and 24% lower for Doctorate when compared to the percentages for data science workers. Thus, the shape and distribution of above graph do not match the shape and distribution of the education for the United States.

Are data science and machine learning employees’ parents similar?

##             High School Some college/university              Bachelor's 
##                     156                     127                     324 
##                Master's               Doctorate     Professional Degree 
##                     321                     163                      82 
##                   Other                    NA's 
##                      33                      34

The parents’ are less education than their children and the graph does not exhibit such a nice symmetry as the data scientist education graph did. It is a goal of many parents to have their children better off than the parents’ had growing up, so these parents are probably happy that their child is more educated than they are.

What was the data scientists’ major?

##                                                              
##                                                          104 
##                                             A health science 
##                                                            9 
##                                      A humanities discipline 
##                                                           44 
##                                             A social science 
##                                                          106 
##                                                      Biology 
##                                                           50 
##                                             Computer Science 
##                                                          177 
##                                       Electrical Engineering 
##                                                          100 
##                           Engineering (non-computer focused) 
##                                                          135 
##                                 Fine arts or performing arts 
##                                                           11 
##                                     I never declared a major 
##                                                            0 
## Information technology, networking, or system administration 
##                                                           25 
##                               Management information systems 
##                                                           17 
##                                    Mathematics or statistics 
##                                                          244 
##                                                        Other 
##                                                           81 
##                                                      Physics 
##                                                          111 
##                                                   Psychology 
##                                                           26

The major graph shows that the plurality of data scientists and machine learning workers studied mathematics or statistics followed by computer science than engineering. Excluding “NA,” 76% had a stem major (Health Science, Biology, Computer Science, Engineering, IT, Management Information Systems, Math, and Physics) and 24% did not (Humanities, Social Science, Fine Arts, Other, Psychology).

Another important characteristic for many people in choosing a profession is the compensation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   76000  110000  120494  150000 2500000      27

When taking the square root of the compensation, a normal distribution appears.

Have data scientist’s salaries been increasing, decreasing or stagnant?

##                                                                   Other 
##                                                                      21 
##         I do not want to share information about my salary/compensation 
##                                                                      34 
##                                          I was not employed 3 years ago 
##                                                                     109 
##                                             I am not currently employed 
##                                                                       2 
##                                               Has decreased 20% or more 
##                                                                      17 
##                                        Has decreased between 6% and 19% 
##                                                                      15 
## Has stayed about the same (has not increased or decreased more than 5%) 
##                                                                     266 
##                                        Has increased between 6% and 19% 
##                                                                     328 
##                                               Has increased 20% or more 
##                                                                     443 
##                                                                    NA's 
##                                                                       5

The figure shows that data scientists and machine learners salaries have been increasing for most of them, and the plurality of them have seen theirs increase by 20% or more.

These workers make a lot more than the median and mean United States income. A next logical step could be what is their job title.

##                                                          Business Analyst 
##                                    0                                   54 
##                   Computer Scientist                         Data Analyst 
##                                   22                                  123 
##                           Data Miner                       Data Scientist 
##                                    7                                  419 
##                DBA/Database Engineer                             Engineer 
##                                   20                                   52 
##            Machine Learning Engineer     Operations Research Practitioner 
##                                   57                                   13 
##                                Other                   Predictive Modeler 
##                                  112                                   26 
##                           Programmer                           Researcher 
##                                   12                                   56 
##                 Scientist/Researcher Software Developer/Software Engineer 
##                                  122                                  108 
##                         Statistician 
##                                   37

The plurality said their title was “Data Scientist,” which is as expected since that is what the survey focused on. “Data Analyst” took the next highest spot by beating out “Scientist/Researcher” by one. The black horizontal line shows the median number of votes for a current job title, which is 52.

Does that title fit them well?

##    Poorly      Fine Perfectly      NA's 
##       159       806       254        21

Most think that the title fits them well.

Are they happy at their job?

## I prefer not to share                     1                     2 
##                    11                    33                    14 
##                     3                     4                     5 
##                    66                    57                   120 
##                     6                     7                     8 
##                   116                   259                   271 
##                     9                    10                  NA's 
##                   170                   122                     1

The column graph shows that most of them are happy. The plurality was at 8. All of the upper numbers, meet or exceed the median number of votes per a satisfaction level, which is marked by the gold line.

Finally, what do they spend their time at work doing?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   40.00   38.36   50.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   10.00   20.00   19.03   25.00   90.00       1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    5.00   10.29   15.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   10.00   13.92   20.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.84   20.00  100.00       2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.579   0.000 100.000       5

Data scientist at work spend most of their time gathering data so how much data do they gather?

##  <1MB   1MB  10MB 100MB   1GB  10GB 100GB   1TB  10TB 100TB   1PB  10PB 
##    29    34    90   173   271   219   143    88    39    11     4     3 
## 100PB   1EB  >1EB  NA's 
##     1     0     1   134

How long have they been writing code?

## I don't write code to analyze data                   Less than a year 
##                                  1                                 40 
##                       1 to 2 years                       3 to 5 years 
##                                186                                379 
##                      6 to 10 years                 More than 10 years 
##                                260                                370 
##                               NA's 
##                                  4

Do they understand their algorithms well?

##                        Enough to run the code / standard library 
##                                                              123 
##                           Enough to tune the parameters properly 
##                                                               97 
##         Enough to explain the algorithm to someone non-technical 
##                                                              504 
##                   Enough to refine and innovate on the algorithm 
##                                                              152 
##   Enough to code it again from scratch, albeit it may run slowly 
##                                                              306 
## Enough to code it from scratch and have it be fast and efficient 
##                                                               51 
##                                                             NA's 
##                                                                7

What algorithms do they plan on learning or using in 2018?

##                                             
##                                          45 
##                           Anomaly Detection 
##                                          50 
##                           Association Rules 
##                                           2 
##                            Bayesian Methods 
##                                          85 
##                            Cluster Analysis 
##                                          28 
##                              Decision Trees 
##                                           8 
##                               Deep learning 
##                                         411 
##   Ensemble Methods (e.g. boosting, bagging) 
##                                          33 
##                             Factor Analysis 
##                                           4 
##           Genetic & Evolutionary Algorithms 
##                                          56 
## I don't plan on learning a new ML/DS method 
##                                          29 
##                               Link Analysis 
##                                           5 
##                                        MARS 
##                                           4 
##                         Monte Carlo Methods 
##                                          23 
##                                 Neural Nets 
##                                         175 
##                                       Other 
##                                          39 
##                      Proprietary Algorithms 
##                                          13 
##                              Random Forests 
##                                          14 
##                                  Regression 
##                                          12 
##                              Rule Induction 
##                                           6 
##                     Social Network Analysis 
##                                          39 
##               Support Vector Machines (SVM) 
##                                          23 
##                           Survival Analysis 
##                                          16 
##                                 Text Mining 
##                                          39 
##                        Time Series Analysis 
##                                          75 
##                             Uplift Modeling 
##                                           6

Note: If you do not know any of the above terms, check the appendix.

What machine learning tool do they plan to use in 2018?

##                                                    
##                                                 44 
##                            Amazon Machine Learning 
##                                                 56 
##                                Amazon Web services 
##                                                 39 
##                                             Angoss 
##                                                  0 
##                                              C/C++ 
##                                                 13 
##                                           Cloudera 
##                                                  2 
##                                          DataRobot 
##                                                 26 
##                                              Flume 
##                                                  1 
##                               Google Cloud Compute 
##                                                 33 
##                                    Hadoop/Hive/Pig 
##                                                 55 
##     I don't plan on learning a new tool/technology 
##                                                 44 
##                                         IBM Cognos 
##                                                  1 
##                                   IBM SPSS Modeler 
##                                                  4 
##                                IBM SPSS Statistics 
##                                                  2 
##                       IBM Watson / Waton Analytics 
##                                                 14 
##                                             Impala 
##                                                  5 
##                                               Java 
##                                                  9 
##                                              Julia 
##                                                 32 
##                                  Jupyter notebooks 
##                                                 37 
##                         KNIME (commercial version) 
##                                                  0 
##                               KNIME (free version) 
##                                                  1 
##                                        Mathematica 
##                                                  2 
##                                      MATLAB/Octave 
##                                                  1 
##                   Microsoft Azure Machine Learning 
##                                                 18 
##                        Microsoft Excel Data Mining 
##                                                  2 
## Microsoft R Server (Formerly Revolution Analytics) 
##                                                  8 
##                   Microsoft SQL Server Data Mining 
##                                                  3 
##                                            Minitab 
##                                                  0 
##                                              NoSQL 
##                                                 10 
##            Oracle Data Mining/ Oracle R Enterprise 
##                                                  1 
##                                             Orange 
##                                                  0 
##                                              Other 
##                                                 76 
##                                               Perl 
##                                                  1 
##                                             Python 
##                                                152 
##                                           QlikView 
##                                                  0 
##                                                  R 
##                                                 78 
##                    RapidMiner (commercial version) 
##                                                  2 
##                          RapidMiner (free version) 
##                                                  4 
##           Salfrod Systems CART/MARS/TreeNet/RF/SPM 
##                                                  0 
##           SAP BusinessObjects Predictive Analytics 
##                                                  1 
##                                           SAS Base 
##                                                  5 
##                               SAS Enterprise Miner 
##                                                  4 
##                                            SAS JMP 
##                                                  2 
##                                      Spark / MLlib 
##                                                109 
##                                                SQL 
##                                                 16 
##                                               Stan 
##                                                 14 
##          Statistica (Quest/Dell-formerly Statsoft) 
##                                                  0 
##                                            Tableau 
##                                                 24 
##                                         TensorFlow 
##                                                286 
##                                     TIBCO Spotfire 
##                                                  0 
##                                   Unix shell / awk 
##                                                  3 
##                                               Weka 
##                                                  0

Further Univariate Analysis

There are many features of interest in this smaller dataset and even more in the broader dataset. This report focuses on gender, age, formal education, recommended programming language, compensation, job satisfaction, and tools and methods to learn. The general statistics or counts of each of those variables were plotted in the univariate section. In the process of plotting that data, temporary new variables were created from existing variables in the dataset.

Four interesting insights from the univariate plots are: 1. Computer Science majors were not the dominant major among data scientists and machine learners. 2. The square root of compensation resulted in a normal distribution. 3. Most data scientists and machine learners say their salaries increased by at least 20% in the last three years. 4. There is a decrease at 6 to 10 years of coding tenure. Is that decrease still prevalent when coding tenure is divided by gender?

The next segment will investigate and see if the statistics or distributions differ when variables are compared or subdivided with other variables, such as is the median age or compensation differ between males and females.

Bivariate Plots

Based on the univariate plots, there are many interesting possible relationships between variables. First, is there a relationship between compensation and age for data scientists?

## 
##  Pearson's product-moment correlation
## 
## data:  mc_responses_usa$Age and mc_responses_usa$CompensationAmount
## t = 7.3631, df = 1201, p-value = 0.0000000000003327
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1531058 0.2612764
## sample estimates:
##       cor 
## 0.2078264

Note: Geom_Jitter was used because age is a continuous variable but individuals only reported the integer age.

Since correlation (0.2078) is not greater than 0.3, there is no meaningful strength between age and compensation according to the Pearson method; but this weak relationship is significant since it has a p-value <= 0.05.

##   Group.1        Age CompensationAmount
## 1       1  0.4471476         12.2090292
## 2       2  1.2254464          0.2941814
## 3       3 -0.5685766         -0.1955254

This is likely because compensation typically increases over time and age always increases over time, so these two variables are typically also going in the same direction. However, being a certain age does not mean you will receive a certain income.

## mc_responses_usa$Gender: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   90000   90000   90000   90000   90000   90000 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: A different identity
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20000  115000  135000  115600  150000  158000       1 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     185   65000   90000  100065  130000  330000       8 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   80000  110000  125427  150000 2500000      18 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: Non-binary, genderqueer, or gender non-conforming
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5000   39000   97300  110825  185000  235000

The gender-compensation plot shows that males have a higher median and average salary than females. They also have a higher max, third quartile, and first quartile. Since only six and eight respondents identified as “a different identity” or “non-binary, gender-queer, or gender non-conforming,” statistically significant decisions cannot be made about their compensation.

## mc_responses_usa$RecommendedLanguage: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1   55000   83000   95931  135000  200000       1 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: C/C++/C#
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5000   66000  104500  120125  160250  350000       1 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: F#
## NULL
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Haskell
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40000   62500   85000   85000  107500  130000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Java
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78000   85000   86000  117000  132500  220000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Julia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  210000  407500  605000  605000  802500 1000000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Matlab
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11500   94000  106000  117158  140000  240000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   70000   82125   94000  117688  156250  200000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Python
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   80000  110000  125335  150000 2500000      20 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: R
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   70000   97500  110730  135000  550000       4 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: SAS
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   65000   70000  100000  120385  140000  300000       1 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Scala
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  108000  122500  140000  140429  157500  175000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: SQL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   71500  108000  104527  130000  220000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Stata
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   73000   95000   81500  103500  114000

The above plot shows Python is the most recommended computer language followed by R then SQL. Python recommenders’ salary is higher than R recommenders.

It will be insightful to know if there is a significant difference between compensation level per a language and compensation level per a recommended language. This Kaggle study did not track that. It is also possible that people recommended both Python and R equally, which Kaggle alsodid not track. Lastly, the study also used “recommends” so it is possible that a data scientist recommended something they do not use.

Since “blank” or “no language” had 30 respondents, 30 was used as the required minimum number of responses to provide useful data. This left only Python (783 respondents), R (283), and SQL (64) with enough respondents for insights to be concluded.

Out of these three languages, the largest median income per recommended language is Python (110,000) then SQL (108,000) then R (97,500). On average, it is Python (125,335), then R (110,730), then SQL (104,527).

What’s the breakdown of education by job satisfaction?

Is this breakdown similar to the parents’ education levels?

## mc_responses_usa$ParentsEducation: High School
##             High School Some college/university              Bachelor's 
##                       1                       9                      37 
##                Master's               Doctorate     Professional Degree 
##                      66                      41                       2 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Some college/university
##             High School Some college/university              Bachelor's 
##                       0                       9                      30 
##                Master's               Doctorate     Professional Degree 
##                      58                      26                       4 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Bachelor's
##             High School Some college/university              Bachelor's 
##                       1                       5                      75 
##                Master's               Doctorate     Professional Degree 
##                     155                      84                       2 
##                   Other 
##                       2 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Master's
##             High School Some college/university              Bachelor's 
##                       1                       3                      82 
##                Master's               Doctorate     Professional Degree 
##                     152                      80                       3 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Doctorate
##             High School Some college/university              Bachelor's 
##                       0                       1                      41 
##                Master's               Doctorate     Professional Degree 
##                      65                      55                       1 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Professional Degree
##             High School Some college/university              Bachelor's 
##                       0                       2                      20 
##                Master's               Doctorate     Professional Degree 
##                      30                      25                       5 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Other
##             High School Some college/university              Bachelor's 
##                       1                       3                       9 
##                Master's               Doctorate     Professional Degree 
##                      10                       7                       2 
##                   Other 
##                       1

In regards to the above correlogram, Spearman method was used because the data was assumed to be non-linear.

Therefore, the null hypothesis is that the Spearman correlation coefficient, rho, is 0. A rho of 0 means that the ranks of one variable do not covary with the ranks of the other variable. In other words, as the ranks of one variable increase, the ranks of the other variable do not increase or decrease. Also, in the graph, a small p-value (typically less than or equal to 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

Excluding the forced correlations with the time spent doing different tasks at work, some logical insights from the correlation matrix are:

  1. Formal education and age are positively correlated as expected (rho = 0.272, p < .001).

  2. Coding tenure and age have the strongest correlation at 0.57, followed by tenure and compensation amount at 0.42, then compensation amount and age at 0.391, and then coding tenure and education at 0.301.

  3. Work dataset size and machine learning tool to learn next year is not unexpected to correlate (rho 0.069, p -.028) as people’s decision on what tool to learn could be affected by how much data they have to process. Similar logic can be applied to recommended language and work dataset size (rho -0.095, p -0.002).

  4. Time model building is correlated to coding tenure, major, formal education, and recommended language next year as possibly expected. (It also showed correlation with Title Fit which is not expected.)

  5. Recommended language correlated with all the work time variables except “time other” and “time gathering data.”

  6. Job satisfaction is correlated with job title fit, salary change, and compensation amount.

  7. Salary change is correlated to coding tenure and compensation amount.

  8. Lastly, there is also a correlation between compensation and: age, gender, education (but not parents’ education), coding tenure, and algorithm understanding level.

There were more variable correlations, but many began to stretch what most people would consider justified. For instance, tenure and work data size was correlated. An argument could be made that the better coders have coded longer, i.e., longer coding tenure. Since they are better coders, they are assigned or chose to work with larger dataset sizes at work. However, others might say that work dataset size is independent of a coders tenure. The dataset size is more contingent on how many people use the service that is collecting the data, or how many respond to a specific survey or the dataset bought by the coder’s employer.

Further Bivariate Analysis

The box plot, before adjusting the y-axis, showed outliers near 100 years old for “a different identity,” “males,” and “gender-queer, or gender non-conforming.” There was no outlier at this high age range for females and individuals who did not specify a gender. This highlights two of the most difficult data collection issues - complete answers and the integrity of the answers. Since 100 years old are outliers, it is also skeptical that these individuals are 100 years old and did not enter an incorrect age.

The correlation matrix displays all the correlations between two variables. Excluding time-dependent variables (age, coding tenure, work time tasks), the strongest relationship was between salary change and compensation amount at .231, then job satisfaction and title fit at .221, and finally, the unexpected correlation between compensation amount and work dataset size at .176.

The statistically significant correlation between compensation amount and gender matches the gender-pay inequality that is sometimes heard in the news. Fortunately, it is only a weak correlation of 0.129.

Multivariate Plots

The densest section around the data scientist’s education of Masters and the job title fit of “fine” shows that the majority of respondents had a Masters degree as we saw in an earlier graph. The lack of a clear grouping of color at either number of data scientist education level leads readers to believe that there is no correlation between parents education and job satisfaction or parents education and data scientist’s education. The few points near the low numbers of job satisfaction mean most respondents were satisfied with their job. Lastly the grouping near 10 and perfectly for Doctorate and the grouping near 1 and poorly in Master’s makes it seem like job title fit, and job satisfaction could be correlated, which according to the correlation matrix they are.

Further Multivariate Analysis

A surprising insight that the lack of influence of Parents Education on variables. It only was correlated with age and title fit weakly. It was expected that parents education would be correlated with offspring’s/data scientist’s education too.

A linear regression model and a K-means cluster model were created which are in the “Final Plots and Summary” section.

Final Plots and Summary

The insights drawn from the numerous plots and statistics of this public dataset can aid both employees, employers, and others, by helping people determine what to learn, what to include in job requirements, what to title job positions, and what salary level to began compensation packages. To continue emphasizing key info about the data science and machine learning industry, two multivariate plots and one cluster analysis are highlighted in the Summary.

Lastly, some tips drawn from the data insights are:

  1. Learn Python, R, and SQL as they are the most used languages by the data scientists.

  2. Learn deep learning and neural nets as they will be the most sought-after techniques in the future.

  3. Develop skills for gathering data as it can be the most time-consuming process in the workflow of a data scientist.

  4. Statistics and mathematics are vital to understanding how certain algorithms work.

Immutable Attributes

Age, gender, and compensation cannot easily be changed by an employee. Compensation is not considered easy to change because it is determined by the employer.

The above graph shows the dominance of male (blue) over females (light red) responses in the survey. It also makes some interesting outliers evident; such as the female that is around 70 years old but one of the lowest paid or the 25-year-old male that is one of the highest paid.

K-Means Clustering

To analyze the point variability of two of these immutable attributes, a K-Means cluster analysis was conducted.

A k of 3 was determined because 2 and 4 cluster groups explained a smaller percent of point variability.

Mutable Attributes

The level which an employee understands something and the programming tools or languages that you recommend can easily be changed by an employee.

## 
## Call:
## lm(formula = MLMnum ~ AlULnum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.278  -4.278  -4.097   3.813  14.903 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 11.05200    0.55126  20.049 <0.0000000000000002 ***
## AlULnum      0.04514    0.14861   0.304               0.761    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.902 on 1231 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  7.494e-05,  Adjusted R-squared:  -0.0007373 
## F-statistic: 0.09226 on 1 and 1231 DF,  p-value: 0.7614

The chart above shows that within Python, the most popular recommended language among data scientists and machine learners, those that also recommended “Deep Learning” also have the most people that know enough to code algorithms from scratch and have it be fast and efficient.

Reflection

The US data science and machine learning dataset contains information on 1240 respondents across 22 variables from 2017. I started by understanding the individual variables in the dataset in the univariate section, and then I explored exciting leads as I continued to make observations on plots, apply statistical concepts, and described relationships in data. Eventually, I explored compensation across many variables and created a K-means cluster graph.

There was never strong trends between variables, but there was a lot of statistically significant correlations. I was also surprised that algorithm understanding level correlated the most with variables by correlating with 12 out of 21 possible variables. The next closest was job satisfaction which correlated with nine other variables.

In future work, data science can benefit by learning the job search duration. Job hunt time was included in the Kaggle survey but was not an answered question in the United States. Additionally, the breakdown of data scientists by race/ethnicity and sexual orientation can also provide interesting insight.

The respondents of Kaggle’s survey were most likely Kaggle users which means the resultant dataset is biased towards those type of respondents. It is unknown if data scientists that use Kaggle are significantly different from data scientists that do not use Kaggle.

Appendix

Terminology

Association Rules - rule-based machine learning method for discovering interesting relations between variables in large databases.[3]

Bayesian Methods - a statistical inference method in which Bayes’ theorem is used to update the probability of a hypothesis as more data becomes available.

Convolutional Neural Network - a class of feed-forward neural networks comprised of one or more convolutional layers and then followed by one or more fully connected layers that have successfully been utilized for analyzing imagery.

Cluster Analysis - grouping a set of items in such a way that objects in the same cluster are more similar (in some sense) to each other than to those in other clusters.

Collaborative Filtering - a technique that has a narrow and general sense and has been used by recommender systems.

Cross-Validation - a model validation method for assessing how the outcomes of statistical analysis will generalize to an independent dataset.

Decision Trees - a decision support tool that uses a model of decisions or tree-like graph and their possible outcomes, including resource costs, utility, and chance event outcomes.

Deep Learning - a subset of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

Dimensional Modeling - set of concepts and techniques used in data warehouse design.

Ensemble Methods (e.g., boosting, bagging) - use of diverse learning algorithms to achieve better predictive performance than could be gained from any of the constituent learning algorithms alone.

Factor Analysis - a statistical technique to describe variability among observed, correlated variables concerning a possibly lower number of unobserved variables named factors. For instance, it is possible that variations in eight observed variables mainly reflect the changes in two underlying, unobserved variables.

Genetic and Evolutionary Algorithms - evolutionary algorithm (EA) includes genetic algorithms and uses mechanisms inspired by biological evolution, such as mutation, reproduction, selection, and recombination. EA is a subset of a generic population-based metaheuristic optimization algorithm. A genetic algorithm is a class of evolutionary algorithm. Although genetic algorithms are a frequently encountered type of evolutionary algorithm, there are other types.

Link Analysis - a data-analysis technique used to evaluate relationships between nodes. Relationships may be identified among various types of nodes, including organizations, people, and transactions.

Multivariate Adaptive Regression Splines - a non-parametric regression technique that automatically models nonlinearities and interactions between variables.

Monte Carlo Methods - a set of computational algorithms that depend on repeated random sampling to obtain numerical results.

Neural Nets - a system of data structures and programs that approximates the operation of the human brain. A neural network ordinarily involves many processors operating in parallel.

Principal Components Analysis - a statistical operation that uses an orthogonal transformation to convert a batch of potentially correlated variables into a group of linearly uncorrelated variables termed principal components.

Random Forests - a composite learning method for regression, classification, and other tasks, that operates by constructing a multitude of decision trees at training time and outputting the class that is the mean prediction (regression) of the individual trees or mode of the classes (classification).

Linear Regression - a linear approach for modeling the relationship between one or more explanatory variables and a scalar dependent variable.

Rule Induction - an area of machine learning in which formal rules are extracted from a set of observations. The rules obtained may represent a full scientific model of the data, or merely local patterns in the data.

Social Network Analysis - a process of examining social structures through the use of networks and graph theory. It characterizes networked structures as nodes (individual actors, things, or people within the system) and the ties, edges, or links (interactions or relationships) that connect them.

Support Vector Machines (SVM) - a discriminative classifier formally defined by a separating hyperplane; i.e., given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.

Survival Analysis - a division of statistics for analyzing the expected time until one or more events happen.

Text Mining - the process of deriving high-quality information from text typically by devising trends or patterns through methods like statistical pattern learning.

Time Series - a sequence of data points indexed in time order. Since a time series is commonly a sequence taken at successive equally spaced intervals, it is a sequence of discrete-time data.

Uplift Modeling - a predictive modeling method that directly illustrates the incremental impact of a treatment, such as a marketing action, on an actor’s behavior.

Works Cited

  1. Kaggle. “Kaggle ML and Data Science Survey, 2017.” Kaggle, 2017. Web. 12 January 2018. https://www.kaggle.com/kaggle/kaggle-survey-2017/.

  2. The percentages, which are calculated based on Census data by counting people that had attained that level or higher add up to more than 100% because they are cumulative. For example, it is assumed that all people with doctorates also have undergraduate and high school degrees, and are thus counted twice in the “lower” categories. “Educational Attainment in the United States: 2014”. U.S. Census Bureau. Retrieved January 29, 2015.

  3. Piatetsky-Shapiro, Gregory (1991), Discovery, analysis, and presentation of strong rules, in Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA.

======= United States Data Science and Machine Learning Employees

Introduction

Data science is a growingly popular field. Typical job titles related to this field include business analyst, data analyst, data scientist, decision scientist, and data engineer. The below graph will show the interest over time of popular terms relating to data science. The numbers on the graph represent search interest relative to the highest point on the chart for the given time. A value of 100 is the peak popularity of the term. A value of 50 means that the term is half as popular. Similarly, a score of 0 means the term was less than 1% as popular as the peak. The graph’s data is worldwide search on Google from January 2014 to January 2018.

The term graph shows the increasing popularity of the terms data science and machine learning. There is also a downward trend in the usage of business intelligence. More data science and machine learning terms will be mentioned in this report.

This report begins with a general description of the dataset and then conducts exploratory data analysis. The analysis is divided by the number of variables analyzed. The goals of the report are to convey the insights from the data science dataset and to show knowledge of data science techniques and tools including R, RMD, statistics, and others. For data science or machine learning terms a reader does not know, check the appendix to learn the definition or to read a concise description.

Dataset

In 2017, Kaggle conducted an industry-wide survey to establish a comprehensive view of data science and machine learning. The resultant multiple choice dataset has 16,716 usable observations/responses of 228 variables from 171 countries and territories. The survey was live for 18 days. The dataset can show who is working with data, what’s happening at the cutting edge of machine learning, and how new data scientists can best break into the field.[1]

Every respondent did not see every question. In an attempt to ask relevant questions to each respondent, Kaggle asked work-related questions to employed data scientists and learning related questions to students. There is a column in the schema.csv file called “Asked” that describes who saw each question. You can learn more about the different segments used in the schema.csv file and RespondentTypeREADME.txt in the data tab.

While the dataset includes multiple files and countries, this report will focus on the cleaner dataset file multipleChoiceResponses.csv and respondents from the United States paid in USD. The final “mc_responses_usa” dataset includes 1240 responses of 23 variables. Since the United States is a net immigration country, meaning more people are moving to the U.S. than leaving it, it is likely that the data scientists that responded in the survey will stay in the US.

After limiting the large Kaggle dataset to a smaller one with only the variables that will be used, the dataset needed to be cleaned and manipulated. Some variables, like education, had its responses reworded to maintain consistency. Many variables were also converted to ordered factors and the compensation was converted into a numeric. This was done so that the report was cleaner, summary statistics could be calculated, or to form the appropriate charts.

The structure of the new dataset is shown below:

str(mc_responses_usa)
## 'data.frame':    1240 obs. of  22 variables:
##  $ Gender                     : Factor w/ 5 levels "","A different identity",..: 4 4 4 4 4 3 3 3 4 4 ...
##  $ Age                        : int  56 25 33 35 40 31 39 30 50 59 ...
##  $ CurrentJobTitle            : Factor w/ 17 levels "","Business Analyst",..: 10 14 15 8 15 15 12 2 12 6 ...
##  $ TitleFit                   : Ord.factor w/ 3 levels "Poorly"<"Fine"<..: 1 2 3 2 3 3 2 2 3 2 ...
##  $ MLToolNextYear             : Factor w/ 52 levels "","Amazon Machine Learning",..: 49 2 36 44 49 3 36 19 34 24 ...
##  $ MLMethodNextYear           : Factor w/ 26 levels "","Anomaly Detection",..: 15 7 7 7 7 8 4 15 24 19 ...
##  $ RecommendedLanguage        : Factor w/ 14 levels "","C/C++/C#",..: 9 9 7 9 9 10 10 9 10 9 ...
##  $ FormalEducation            : Ord.factor w/ 7 levels "High School"<..: 4 3 5 5 5 5 4 3 5 4 ...
##  $ Major                      : Factor w/ 16 levels "","A health science",..: 13 15 7 15 15 8 8 1 13 15 ...
##  $ Tenure                     : Ord.factor w/ 6 levels "I don't write code to analyze data"<..: 6 4 2 4 6 5 3 3 6 3 ...
##  $ ParentsEducation           : Ord.factor w/ 7 levels "High School"<..: 1 4 3 5 NA 4 3 3 1 3 ...
##  $ WorkDatasetSize            : Ord.factor w/ 15 levels "<1MB"<"1MB"<"10MB"<..: 5 5 7 3 7 5 4 NA 4 5 ...
##  $ TimeGatheringData          : int  50 0 0 30 60 30 80 60 20 60 ...
##  $ TimeModelBuilding          : num  20 80 0 20 20 30 10 10 25 10 ...
##  $ TimeProduction             : num  0 0 0 5 0 10 5 10 10 10 ...
##  $ TimeVisualizing            : num  10 20 0 15 20 10 5 10 25 10 ...
##  $ TimeFindingInsights        : num  20 0 0 30 0 20 0 10 20 10 ...
##  $ TimeOther                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AlgorithmUnderstandingLevel: Ord.factor w/ 6 levels "Enough to run the code / standard library"<..: 4 4 2 3 5 3 3 3 4 3 ...
##  $ CompensationAmount         : num  250000 20000 100000 133000 215000 83500 115000 80000 135000 75000 ...
##  $ SalaryChange               : Ord.factor w/ 9 levels "Other"<"I do not want to share information about my salary/compensation"<..: 9 7 9 8 9 9 3 8 8 8 ...
##  $ JobSatisfaction            : Ord.factor w/ 11 levels "I prefer not to share"<..: 11 7 8 9 9 8 11 11 9 9 ...

Univariate Plots

In this section, preliminary exploration of the dataset occurs, along with summaries of the data and univariate plots to understand the structure of the individual variables.

The first question that drove exploration of this data is, “What age is a typical data science worker?” The box plot and summary statistics show that the median age is 33 and the mean is 36. This shows that the there is a slight upward skew. In the box plot, the red dot marks the average age. The thick black line marks the median age. The box is the inner 50% quartiles. The bottom line marks the bottom 25% quartile (28 years old), and the top line marks the upper 75% quartile (42 years old). The circles then mark outliers. For example, there is an outlier at 1 and 72 which are the min and max. It is likely that 1 is a fake age because it is improbable that a toddler works in data science.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   28.00   33.00   36.34   42.00   72.00      10

If US high school graduation is common at age 18, college Bachelor graduation at 21, and Master’s at 23 and the lowest quartile is 28, do most of these data science workers have a Ph.D.?

##             High School Some college/university              Bachelor's 
##                       4                      34                     301 
##                Master's               Doctorate     Professional Degree 
##                     551                     328                      19 
##                   Other 
##                       3

The above graph shows that data science and machine learning workers have a fairly bell-shaped normal distribution centered at a Master’s degree. Most workers have a Master’s degree (44%), then a Doctorate (26%), and then a Bachelor’s (24%). Master’s gap is a huge 1.68x more than Doctorates, 1.83x more than Bachelor’s, and 29x more than a Professional degree.

This Bachelors: Masters: Doctorate ratio is unusual given that data science is not a licensed field were people are required to obtain a certain degree like medical doctors. It is also interesting because for ages 25 and over, the educational attainment in the United States is 8% higher for Bachelor’s, 32% lower for Master’s, and 24% lower for Doctorate when compared to the percentages for data science workers. Thus, the shape and distribution of above graph do not match the shape and distribution of the education for the United States.

Are data science and machine learning employees’ parents similar?

##             High School Some college/university              Bachelor's 
##                     156                     127                     324 
##                Master's               Doctorate     Professional Degree 
##                     321                     163                      82 
##                   Other                    NA's 
##                      33                      34

The parents’ are less education than their children and the graph does not exhibit such a nice symmetry as the data scientist education graph did. It is a goal of many parents to have their children better off than the parents’ had growing up, so these parents are probably happy that their child is more educated than they are.

What was the data scientists’ major?

##                                                              
##                                                          104 
##                                             A health science 
##                                                            9 
##                                      A humanities discipline 
##                                                           44 
##                                             A social science 
##                                                          106 
##                                                      Biology 
##                                                           50 
##                                             Computer Science 
##                                                          177 
##                                       Electrical Engineering 
##                                                          100 
##                           Engineering (non-computer focused) 
##                                                          135 
##                                 Fine arts or performing arts 
##                                                           11 
##                                     I never declared a major 
##                                                            0 
## Information technology, networking, or system administration 
##                                                           25 
##                               Management information systems 
##                                                           17 
##                                    Mathematics or statistics 
##                                                          244 
##                                                        Other 
##                                                           81 
##                                                      Physics 
##                                                          111 
##                                                   Psychology 
##                                                           26

The major graph shows that the plurality of data scientists and machine learning workers studied mathematics or statistics followed by computer science than engineering. Excluding “NA,” 76% had a stem major (Health Science, Biology, Computer Science, Engineering, IT, Management Information Systems, Math, and Physics) and 24% did not (Humanities, Social Science, Fine Arts, Other, Psychology).

Another important characteristic for many people in choosing a profession is the compensation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   76000  110000  120494  150000 2500000      27

When taking the square root of the compensation, a normal distribution appears.

Have data scientist’s salaries been increasing, decreasing or stagnant?

##                                                                   Other 
##                                                                      21 
##         I do not want to share information about my salary/compensation 
##                                                                      34 
##                                          I was not employed 3 years ago 
##                                                                     109 
##                                             I am not currently employed 
##                                                                       2 
##                                               Has decreased 20% or more 
##                                                                      17 
##                                        Has decreased between 6% and 19% 
##                                                                      15 
## Has stayed about the same (has not increased or decreased more than 5%) 
##                                                                     266 
##                                        Has increased between 6% and 19% 
##                                                                     328 
##                                               Has increased 20% or more 
##                                                                     443 
##                                                                    NA's 
##                                                                       5

The figure shows that data scientists and machine learners salaries have been increasing for most of them, and the plurality of them have seen theirs increase by 20% or more.

These workers make a lot more than the median and mean United States income. A next logical step could be what is their job title.

##                                                          Business Analyst 
##                                    0                                   54 
##                   Computer Scientist                         Data Analyst 
##                                   22                                  123 
##                           Data Miner                       Data Scientist 
##                                    7                                  419 
##                DBA/Database Engineer                             Engineer 
##                                   20                                   52 
##            Machine Learning Engineer     Operations Research Practitioner 
##                                   57                                   13 
##                                Other                   Predictive Modeler 
##                                  112                                   26 
##                           Programmer                           Researcher 
##                                   12                                   56 
##                 Scientist/Researcher Software Developer/Software Engineer 
##                                  122                                  108 
##                         Statistician 
##                                   37

The plurality said their title was “Data Scientist,” which is as expected since that is what the survey focused on. “Data Analyst” took the next highest spot by beating out “Scientist/Researcher” by one. The black horizontal line shows the median number of votes for a current job title, which is 52.

Does that title fit them well?

##    Poorly      Fine Perfectly      NA's 
##       159       806       254        21

Most think that the title fits them well.

Are they happy at their job?

## I prefer not to share                     1                     2 
##                    11                    33                    14 
##                     3                     4                     5 
##                    66                    57                   120 
##                     6                     7                     8 
##                   116                   259                   271 
##                     9                    10                  NA's 
##                   170                   122                     1

The column graph shows that most of them are happy. The plurality was at 8. All of the upper numbers, meet or exceed the median number of votes per a satisfaction level, which is marked by the gold line.

Finally, what do they spend their time at work doing?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   40.00   38.36   50.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   10.00   20.00   19.03   25.00   90.00       1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    5.00   10.29   15.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   10.00   13.92   20.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.84   20.00  100.00       2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.579   0.000 100.000       5

Data scientist at work spend most of their time gathering data so how much data do they gather?

##  <1MB   1MB  10MB 100MB   1GB  10GB 100GB   1TB  10TB 100TB   1PB  10PB 
##    29    34    90   173   271   219   143    88    39    11     4     3 
## 100PB   1EB  >1EB  NA's 
##     1     0     1   134

How long have they been writing code?

## I don't write code to analyze data                   Less than a year 
##                                  1                                 40 
##                       1 to 2 years                       3 to 5 years 
##                                186                                379 
##                      6 to 10 years                 More than 10 years 
##                                260                                370 
##                               NA's 
##                                  4

Do they understand their algorithms well?

##                        Enough to run the code / standard library 
##                                                              123 
##                           Enough to tune the parameters properly 
##                                                               97 
##         Enough to explain the algorithm to someone non-technical 
##                                                              504 
##                   Enough to refine and innovate on the algorithm 
##                                                              152 
##   Enough to code it again from scratch, albeit it may run slowly 
##                                                              306 
## Enough to code it from scratch and have it be fast and efficient 
##                                                               51 
##                                                             NA's 
##                                                                7

What algorithms do they plan on learning or using in 2018?

##                                             
##                                          45 
##                           Anomaly Detection 
##                                          50 
##                           Association Rules 
##                                           2 
##                            Bayesian Methods 
##                                          85 
##                            Cluster Analysis 
##                                          28 
##                              Decision Trees 
##                                           8 
##                               Deep learning 
##                                         411 
##   Ensemble Methods (e.g. boosting, bagging) 
##                                          33 
##                             Factor Analysis 
##                                           4 
##           Genetic & Evolutionary Algorithms 
##                                          56 
## I don't plan on learning a new ML/DS method 
##                                          29 
##                               Link Analysis 
##                                           5 
##                                        MARS 
##                                           4 
##                         Monte Carlo Methods 
##                                          23 
##                                 Neural Nets 
##                                         175 
##                                       Other 
##                                          39 
##                      Proprietary Algorithms 
##                                          13 
##                              Random Forests 
##                                          14 
##                                  Regression 
##                                          12 
##                              Rule Induction 
##                                           6 
##                     Social Network Analysis 
##                                          39 
##               Support Vector Machines (SVM) 
##                                          23 
##                           Survival Analysis 
##                                          16 
##                                 Text Mining 
##                                          39 
##                        Time Series Analysis 
##                                          75 
##                             Uplift Modeling 
##                                           6

Note: If you do not know any of the above terms, check the appendix.

What machine learning tool do they plan to use in 2018?

##                                                    
##                                                 44 
##                            Amazon Machine Learning 
##                                                 56 
##                                Amazon Web services 
##                                                 39 
##                                             Angoss 
##                                                  0 
##                                              C/C++ 
##                                                 13 
##                                           Cloudera 
##                                                  2 
##                                          DataRobot 
##                                                 26 
##                                              Flume 
##                                                  1 
##                               Google Cloud Compute 
##                                                 33 
##                                    Hadoop/Hive/Pig 
##                                                 55 
##     I don't plan on learning a new tool/technology 
##                                                 44 
##                                         IBM Cognos 
##                                                  1 
##                                   IBM SPSS Modeler 
##                                                  4 
##                                IBM SPSS Statistics 
##                                                  2 
##                       IBM Watson / Waton Analytics 
##                                                 14 
##                                             Impala 
##                                                  5 
##                                               Java 
##                                                  9 
##                                              Julia 
##                                                 32 
##                                  Jupyter notebooks 
##                                                 37 
##                         KNIME (commercial version) 
##                                                  0 
##                               KNIME (free version) 
##                                                  1 
##                                        Mathematica 
##                                                  2 
##                                      MATLAB/Octave 
##                                                  1 
##                   Microsoft Azure Machine Learning 
##                                                 18 
##                        Microsoft Excel Data Mining 
##                                                  2 
## Microsoft R Server (Formerly Revolution Analytics) 
##                                                  8 
##                   Microsoft SQL Server Data Mining 
##                                                  3 
##                                            Minitab 
##                                                  0 
##                                              NoSQL 
##                                                 10 
##            Oracle Data Mining/ Oracle R Enterprise 
##                                                  1 
##                                             Orange 
##                                                  0 
##                                              Other 
##                                                 76 
##                                               Perl 
##                                                  1 
##                                             Python 
##                                                152 
##                                           QlikView 
##                                                  0 
##                                                  R 
##                                                 78 
##                    RapidMiner (commercial version) 
##                                                  2 
##                          RapidMiner (free version) 
##                                                  4 
##           Salfrod Systems CART/MARS/TreeNet/RF/SPM 
##                                                  0 
##           SAP BusinessObjects Predictive Analytics 
##                                                  1 
##                                           SAS Base 
##                                                  5 
##                               SAS Enterprise Miner 
##                                                  4 
##                                            SAS JMP 
##                                                  2 
##                                      Spark / MLlib 
##                                                109 
##                                                SQL 
##                                                 16 
##                                               Stan 
##                                                 14 
##          Statistica (Quest/Dell-formerly Statsoft) 
##                                                  0 
##                                            Tableau 
##                                                 24 
##                                         TensorFlow 
##                                                286 
##                                     TIBCO Spotfire 
##                                                  0 
##                                   Unix shell / awk 
##                                                  3 
##                                               Weka 
##                                                  0

Further Univariate Analysis

There are many features of interest in this smaller dataset and even more in the broader dataset. This report focuses on gender, age, formal education, recommended programming language, compensation, job satisfaction, and tools and methods to learn. The general statistics or counts of each of those variables were plotted in the univariate section. In the process of plotting that data, temporary new variables were created from existing variables in the dataset.

Four interesting insights from the univariate plots are: 1. Computer Science majors were not the dominant major among data scientists and machine learners. 2. The square root of compensation resulted in a normal distribution. 3. Most data scientists and machine learners say their salaries increased by at least 20% in the last three years. 4. There is a mysteries decrease at 6 to 10 years of coding tenure. Is that decrease still prevalent when coding tenure is divided by gender?

The next segment will investigate and see if the statistics or distributions differ when variables are compared or subdivided with other variables, such as is the median age or compensation differ between males and females.

Bivariate Plots

Based on the univariate plots, there are many interesting possible relationships between variables. First, is there a relationship between compensation and age for data scientists?

## 
##  Pearson's product-moment correlation
## 
## data:  mc_responses_usa$Age and mc_responses_usa$CompensationAmount
## t = 7.3631, df = 1201, p-value = 0.0000000000003327
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1531058 0.2612764
## sample estimates:
##       cor 
## 0.2078264

Note: Geom_Jitter was used because age is a continuous variable but individuals only reported the integer age.

Since correlation (0.2078) is not greater than 0.3, there is no meaningful strength between age and compensation according to the Pearson method; but this weak relationship is significant since it has a p-value <= 0.05.

##   Group.1        Age CompensationAmount
## 1       1  0.4471476         12.2090292
## 2       2  1.2254464          0.2941814
## 3       3 -0.5685766         -0.1955254

This is likely because compensation typically increases over time and age always increases over time, so these two variables are typically also going in the same direction. However, being a certain age does not mean you will receive a certain income.

## mc_responses_usa$Gender: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   90000   90000   90000   90000   90000   90000 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: A different identity
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20000  115000  135000  115600  150000  158000       1 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     185   65000   90000  100065  130000  330000       8 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   80000  110000  125427  150000 2500000      18 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: Non-binary, genderqueer, or gender non-conforming
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5000   39000   97300  110825  185000  235000

The gender-compensation plot shows that males have a higher median and average salary than females. They also have a higher max, third quartile, and first quartile. Since only six and eight respondents identified as “a different identity” or “non-binary, gender-queer, or gender non-conforming,” statistically significant decisions cannot be made about their compensation.

## mc_responses_usa$RecommendedLanguage: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1   55000   83000   95931  135000  200000       1 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: C/C++/C#
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5000   66000  104500  120125  160250  350000       1 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: F#
## NULL
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Haskell
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40000   62500   85000   85000  107500  130000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Java
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78000   85000   86000  117000  132500  220000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Julia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  210000  407500  605000  605000  802500 1000000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Matlab
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11500   94000  106000  117158  140000  240000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   70000   82125   94000  117688  156250  200000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Python
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   80000  110000  125335  150000 2500000      20 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: R
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   70000   97500  110730  135000  550000       4 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: SAS
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   65000   70000  100000  120385  140000  300000       1 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Scala
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  108000  122500  140000  140429  157500  175000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: SQL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   71500  108000  104527  130000  220000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Stata
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   73000   95000   81500  103500  114000

The above plot shows Python is the most recommended computer language followed by R then SQL. Python recommenders’ salary is higher than R recommenders.

It will be insightful to know if there is a significant difference between compensation level per a language and compensation level per a recommended language. This Kaggle study did not track that. It is also possible that people recommended both Python and R equally, which Kaggle did not tracked. Lastly, the study also used “recommends” so it is possible that a data scientist recommended something they do not use.

Since “blank” or “no language” had 30 respondents, 30 was used as the cut-off level of required minimum number of responses to provide useful data. This left only Python (783 respondents), R (283), and SQL (64) with enough of respondents for insights to be concluded.

Out of these three languages, the largest median income per recommended language is Python (110,000) then SQL (108,000) then R (97,500). On average, it is Python (125,335), then R (110,730), then SQL (104,527).

What’s the breakdown of education by job satisfaction?

Is this breakdown similar with parents’ education?

## mc_responses_usa$ParentsEducation: High School
##             High School Some college/university              Bachelor's 
##                       1                       9                      37 
##                Master's               Doctorate     Professional Degree 
##                      66                      41                       2 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Some college/university
##             High School Some college/university              Bachelor's 
##                       0                       9                      30 
##                Master's               Doctorate     Professional Degree 
##                      58                      26                       4 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Bachelor's
##             High School Some college/university              Bachelor's 
##                       1                       5                      75 
##                Master's               Doctorate     Professional Degree 
##                     155                      84                       2 
##                   Other 
##                       2 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Master's
##             High School Some college/university              Bachelor's 
##                       1                       3                      82 
##                Master's               Doctorate     Professional Degree 
##                     152                      80                       3 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Doctorate
##             High School Some college/university              Bachelor's 
##                       0                       1                      41 
##                Master's               Doctorate     Professional Degree 
##                      65                      55                       1 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Professional Degree
##             High School Some college/university              Bachelor's 
##                       0                       2                      20 
##                Master's               Doctorate     Professional Degree 
##                      30                      25                       5 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Other
##             High School Some college/university              Bachelor's 
##                       1                       3                       9 
##                Master's               Doctorate     Professional Degree 
##                      10                       7                       2 
##                   Other 
##                       1

In regards to the above correlogram, Spearman method was used because the data was assumed to be non-linear.

Therefore, the null hypothesis is that the Spearman correlation coefficient, rho, is 0. A rho of 0 means that the ranks of one variable do not covary with the ranks of the other variable. In other words, as the ranks of one variable increase, the ranks of the other variable do not increase or decrease. Also, in the graph, a small p-value (typically ??? 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

Excluding the forced correlations with the time spent doing different tasks at work, some logical insights from the correlation matrix are:

  1. Formal education and age are positively correlated as expected (rho = 0.272, p < .001).

  2. Coding tenure and age have the strongest correlation at 0.57, followed by tenure and compensation amount at 0.42, then compensation amount and age at 0.391, and then coding tenure and education at 0.301.

  3. Work dataset size and machine learning tool to learn next year is not unexpected to correlate (rho 0.069, p -.028) as people’s decision on what tool to learn could be affected by how much data they have to process. Similar logic can be applied to recommended language and work dataset size (rho -0.095, p -0.002).

  4. Time model building is correlated to coding tenure, major, formal education, and recommended language next year as possibly expected. (It also showed correlation with Title Fit which is not expected.)

  5. Recommended language correlated with all the work time variables except “time other” and “time gathering data.”

  6. Job satisfaction is correlated with job title fit, salary change, and compensation amount.

  7. Salary change is correlated to coding tenure and compensation amount.

  8. Lastly, there is also a correlation between compensation and: age, gender, education (but not parents’ education), coding tenure, and algorithm understanding level.

There were more variable correlations, but many began to stretch what most people would consider justified. For instance, tenure and work data size was correlated. An argument could be made that the better coders have coded longer, i.e., longer coding tenure. Since they are better coders, they are assigned or chose to work with larger dataset sizes at work. However, others might say that work dataset size is independent of a coders tenure. The dataset size is more contingent on how many people use the service that is collecting the data, or how many respond to a specific survey or the dataset bought by the coder’s employer.

Further Bivariate Analysis

The box plot, before adjusting the y-axis, showed outliers near 100 years old for “a different identity,” “males,” and “gender-queer, or gender non-conforming.” There was no outlier at this high age range for females and individuals who did not specify a gender. This highlights two of the most difficult data collection issues - complete answers and the integrity of the answers. Since 100 years old are outliers, it is also skeptical that these individuals are 100 years old and did not enter an incorrect age.

The correlation matrix displays all the correlations between two variables. Excluding time-dependent variables (age, coding tenure, work time tasks), the strongest relationship was between salary change and compensation amount at .231, then job satisfaction and title fit at .221, and finally, the unexpected correlation between compensation amount and work dataset size at .176.

The statistically significant correlation between compensation amount and gender matches the gender-pay inequality that is sometimes heard in the news. Fortunately, it is only a weak correlation of 0.129.

Multivariate Plots

The densest section around the data scientist’s education of Masters and the job title fit of “fine” shows that the majority of respondents had a Masters degree as we saw in an earlier graph. The lack of a clear grouping of color at either number of data scientist education level leads readers to believe that there is no correlation between parents education and job satisfaction or parents education and data scientist’s education. The few points near the low numbers of job satisfaction means most respondents were satisfied with their job. Lastly the grouping near 10 and perfectly for Doctorate and the grouping near 1 and poorly in Master’s makes it seem like job title fit, and job satisfaction could be correlated, which according to the correlation matrix they are.

Further Multivariate Analysis

A surprising insight that the lack of influence of Parents Education on variables. It only was correlated with age and title fit weakly. It was expected that parents education would be correlated with offspring’s/data scientist’s education too.

A linear regression model and a K-means cluster model were created which are in the “Final Plots and Summary” section.

Final Plots and Summary

The insights drawn from the numerous plots and statistics of this public dataset can aid both employees, employers, and others, by helping people determine what to learn, what to include in job requirements, what to title job positions, and what salary level to began compensation packages. To continue emphasizing key info about the data science and machine learning industry, two multivariate plots and a cluster analysis are highlighted in the summary.

Lastly, some tips drawn from the data insights are:

  1. Learn Python, R, and SQL as they are the most used languages by the data scientists.

  2. Learn deep learning and neural nets as they will be the most sought-after techniques in the future.

  3. Develop skills for gathering data as it can be the most time-consuming process in the workflow of a data scientist.

  4. Statistics and mathematics are vital to understanding how certain algorithms work.

Immutable Attributes

Age, gender, and compensation can not easily be changed by an employee. Compensation is not considered easy to change because it is determined by the employer.

The above graph shows the dominance of male (blue) over females (light red) responses in the survey. It also makes some interesting outliers evident; such as the female that is around 70 years old but one of the lowest paid or the 25-year-old male that is one of the highest paid.

K-Means Clustering

To analyze the point variability of two of these immutable attributes, a K-Means cluster analysis was conducted.

A k of 3 was determined because 2 and 4 cluster groups explained a smaller percent of point variability.

Mutable Attributes

The level which an employee understands something and the programming tools or languages that you recommend can easily be changed by an employee.

## 
## Call:
## lm(formula = MLMnum ~ AlULnum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.278  -4.278  -4.097   3.813  14.903 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 11.05200    0.55126  20.049 <0.0000000000000002 ***
## AlULnum      0.04514    0.14861   0.304               0.761    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.902 on 1231 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  7.494e-05,  Adjusted R-squared:  -0.0007373 
## F-statistic: 0.09226 on 1 and 1231 DF,  p-value: 0.7614

The chart above shows the within Python, the most popularly recommended language among data scientists and machine learners, those that also recommended “Deep Learning” also have the most people that know enough to code algorithms from scratch and have it be fast and efficient.

Reflection

The US data science and machine learning dataset contains information on 1240 respondents across 22 variables from 2017. I started by understanding the individual variables in the dataset in the univariate section, and then I explored exciting leads as I continued to make observations on plots, apply statistical concepts, and described relationships in data. Eventually, I explored compensation across many variables and created a K-means cluster graph.

There was never strong trends between variables, but there was a lot of statistically significant correlations. I was also surprised that algorithm understanding level correlated the most with variables by correlating with 12 out of 21 possible variables. The next closest was job satisfaction which correlated with nine other variables.

In future work, data science can benefit by learning the job search duration. Job hunt time was included in the Kaggle survey but was not an answered question in the United States. Additionally, the breakdown of data scientists by race/ethnicity and sexual orientation can also provide interesting insight.

The respondents of Kaggle’s survey were most likely Kaggle users which means the resultant dataset is biased towards those type of respondents. It is unknown if data scientists that use Kaggle are significantly different from data scientists that do not use Kaggle.

Appendix

Terminology

Association Rules - rule-based machine learning method for discovering interesting relations between variables in large databases.[3]

Bayesian Methods - a statistical inference method in which Bayes’ theorem is used to update the probability of a hypothesis as more data becomes available.

Convolutional Neural Network - a class of feed-forward neural networks comprised of one or more convolutional layers and then followed by one or more fully connected layers that have successfully been utilized for analyzing imagery.

Cluster Analysis - grouping a set of items in such a way that objects in the same cluster are more similar (in some sense) to each other than to those in other clusters.

Collaborative Filtering - a technique that has a narrow and general sense and has been used by recommender systems.

Cross-Validation - a model validation method for assessing how the outcomes of a statistical analysis will generalize to an independent dataset.

Decision Trees - a decision support tool that uses a model of decisions or tree-like graph and their possible outcomes, including resource costs, utility, and chance event outcomes.

Deep Learning - a subset of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

Dimensional Modeling - set of concepts and techniques used in data warehouse design.

Ensemble Methods (e.g., boosting, bagging) - use of diverse learning algorithms to achieve better predictive performance than could be gained from any of the constituent learning algorithms alone.

Factor Analysis - a statistical technique to describe variability among observed, correlated variables concerning a possibly lower number of unobserved variables named factors. For instance, it is possible that variations in eight observed variables mainly reflect the changes in two underlying, unobserved variables.

Genetic and Evolutionary Algorithms - evolutionary algorithm (EA) includes genetic algorithms and uses mechanisms inspired by biological evolution, such as mutation, reproduction, selection, and recombination. EA is a subset of a generic population-based metaheuristic optimization algorithm. A genetic algorithm is a class of evolutionary algorithm. Although genetic algorithms are a frequently encountered type of evolutionary algorithm, there are other types.

Link Analysis - a data-analysis technique used to evaluate relationships between nodes. Relationships may be identified among various types of nodes, including organizations, people, and transactions.

Multivariate Adaptive Regression Splines - a non-parametric regression technique that automatically models nonlinearities and interactions between variables.

Monte Carlo Methods - a set of computational algorithms that depend on repeated random sampling to obtain numerical results.

Neural Nets - a system of data structures and programs that approximates the operation of the human brain. A neural network ordinarily involves many processors operating in parallel.

Principal Components Analysis - a statistical operation that uses an orthogonal transformation to convert a batch of potentially correlated variables into a group of linearly uncorrelated variables termed principal components.

Random Forests - a composite learning method for regression, classification, and other tasks, that operates by constructing a multitude of decision trees at training time and outputting the class that is the mean prediction (regression) of the individual trees or mode of the classes (classification).

Linear Regression - a linear approach for modeling the relationship between one or more explanatory variables and a scalar dependent variable.

Rule Induction - an area of machine learning in which formal rules are extracted from a set of observations. The rules obtained may represent a full scientific model of the data, or merely local patterns in the data.

Social Network Analysis - a process of examining social structures through the use of networks and graph theory. It characterizes networked structures as nodes (individual actors, things, or people within the system) and the ties, edges, or links (interactions or relationships) that connect them.

Support Vector Machines (SVM) - a discriminative classifier formally defined by a separating hyperplane; i.e., given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.

Survival Analysis - a division of statistics for analyzing the expected time until one or more events happen

Text Mining - the process of deriving high-quality information from text typically by devising trends or patterns through methods like statistical pattern learning.

Time Series - a sequence of data points indexed in time order. Since a time series is commonly a sequence taken at successive equally spaced intervals, it is a sequence of discrete-time data.

Uplift Modeling - a predictive modeling method that directly illustrates the incremental impact of a treatment, such as a marketing action, on an actor’s behavior.

Works Cited

  1. Kaggle. “Kaggle ML and Data Science Survey, 2017.” Kaggle, 2017. Web. 12 January 2018. https://www.kaggle.com/kaggle/kaggle-survey-2017/.

  2. The percentages, which are calculated based on Census data by counting people that had attained that level or higher. add up to more than 100% because they are cumulative. For example, it is assumed that all people with doctorates also have undergraduate and high school degrees, and are thus counted twice in the “lower” categories. “Educational Attainment in the United States: 2014”. U.S. Census Bureau. Retrieved January 29, 2015.

  3. Piatetsky-Shapiro, Gregory (1991), Discovery, analysis, and presentation of strong rules, in Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA.

>>>>>>> 4d92d2f0050b1feeb8ee1a2c8821a9538f6e4333 ======= United States Data Science and Machine Learning Employees

Introduction

Data science is a growingly popular field. Typical job titles related to this field include business analyst, data analyst, data scientist, decision scientist, and data engineer. The below graph will show the interest over time of popular terms relating to data science. The numbers on the graph represent search interest relative to the highest point on the chart for the given time. A value of 100 is the peak popularity of the term. A value of 50 means that the term is half as popular. Similarly, a score of 0 means the term was less than 1% as popular as the peak. The graph’s data is worldwide search on Google from January 2014 to January 2018.

The term graph shows the increasing popularity of the terms data science and machine learning. There is also a downward trend in the usage of business intelligence. More data science and machine learning terms will be mentioned in this report.

This report begins with a general description of the dataset and then conducts exploratory data analysis. The analysis is divided by the number of variables analyzed. The goals of the report are to convey the insights from the data science dataset and to show knowledge of data science techniques and tools including R, RMD, statistics, and others. For data science or machine learning terms a reader does not know, check the appendix to learn the definition or to read a concise description.

Dataset

In 2017, Kaggle conducted an industry-wide survey to establish a comprehensive view of data science and machine learning. The resultant multiple choice dataset has 16,716 usable observations/responses of 228 variables from 171 countries and territories. The survey was live for 18 days. The dataset can show who is working with data, what’s happening at the cutting edge of machine learning, and how new data scientists can best break into the field.[1]

Every respondent did not see every question. In an attempt to ask relevant questions to each respondent, Kaggle asked work-related questions to employed data scientists and learning related questions to students. There is a column in the schema.csv file called “Asked” that describes who saw each question. You can learn more about the different segments used in the schema.csv file and RespondentTypeREADME.txt in the data tab.

While the dataset includes multiple files and countries, this report will focus on the cleaner dataset file multipleChoiceResponses.csv and respondents from the United States paid in USD. The final “mc_responses_usa” dataset includes 1240 responses of 23 variables. Since the United States is a net immigration country, meaning more people are moving to the U.S. than leaving it, it is likely that the data scientists that responded in the survey will stay in the US.

After limiting the large Kaggle dataset to a smaller one with only the variables that will be used, the dataset needed to be cleaned and manipulated. Some variables, like education, had its responses reworded to maintain consistency. Many variables were also converted to ordered factors and the compensation was converted into a numeric. This was done so that the report was cleaner, summary statistics could be calculated, or to form the appropriate charts.

The structure of the new dataset is shown below:

str(mc_responses_usa)
## 'data.frame':    1240 obs. of  22 variables:
##  $ Gender                     : Factor w/ 5 levels "","A different identity",..: 4 4 4 4 4 3 3 3 4 4 ...
##  $ Age                        : int  56 25 33 35 40 31 39 30 50 59 ...
##  $ CurrentJobTitle            : Factor w/ 17 levels "","Business Analyst",..: 10 14 15 8 15 15 12 2 12 6 ...
##  $ TitleFit                   : Ord.factor w/ 3 levels "Poorly"<"Fine"<..: 1 2 3 2 3 3 2 2 3 2 ...
##  $ MLToolNextYear             : Factor w/ 52 levels "","Amazon Machine Learning",..: 49 2 36 44 49 3 36 19 34 24 ...
##  $ MLMethodNextYear           : Factor w/ 26 levels "","Anomaly Detection",..: 15 7 7 7 7 8 4 15 24 19 ...
##  $ RecommendedLanguage        : Factor w/ 14 levels "","C/C++/C#",..: 9 9 7 9 9 10 10 9 10 9 ...
##  $ FormalEducation            : Ord.factor w/ 7 levels "High School"<..: 4 3 5 5 5 5 4 3 5 4 ...
##  $ Major                      : Factor w/ 16 levels "","A health science",..: 13 15 7 15 15 8 8 1 13 15 ...
##  $ Tenure                     : Ord.factor w/ 6 levels "I don't write code to analyze data"<..: 6 4 2 4 6 5 3 3 6 3 ...
##  $ ParentsEducation           : Ord.factor w/ 7 levels "High School"<..: 1 4 3 5 NA 4 3 3 1 3 ...
##  $ WorkDatasetSize            : Ord.factor w/ 15 levels "<1MB"<"1MB"<"10MB"<..: 5 5 7 3 7 5 4 NA 4 5 ...
##  $ TimeGatheringData          : int  50 0 0 30 60 30 80 60 20 60 ...
##  $ TimeModelBuilding          : num  20 80 0 20 20 30 10 10 25 10 ...
##  $ TimeProduction             : num  0 0 0 5 0 10 5 10 10 10 ...
##  $ TimeVisualizing            : num  10 20 0 15 20 10 5 10 25 10 ...
##  $ TimeFindingInsights        : num  20 0 0 30 0 20 0 10 20 10 ...
##  $ TimeOther                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AlgorithmUnderstandingLevel: Ord.factor w/ 6 levels "Enough to run the code / standard library"<..: 4 4 2 3 5 3 3 3 4 3 ...
##  $ CompensationAmount         : num  250000 20000 100000 133000 215000 83500 115000 80000 135000 75000 ...
##  $ SalaryChange               : Ord.factor w/ 9 levels "Other"<"I do not want to share information about my salary/compensation"<..: 9 7 9 8 9 9 3 8 8 8 ...
##  $ JobSatisfaction            : Ord.factor w/ 11 levels "I prefer not to share"<..: 11 7 8 9 9 8 11 11 9 9 ...

Univariate Plots

In this section, preliminary exploration of the dataset occurs, along with summaries of the data and univariate plots to understand the structure of the individual variables.

The first question that drove exploration of this data is, “What age is a typical data science worker?” The box plot and summary statistics show that the median age is 33 and the mean is 36. This shows that the there is a slight upward skew. In the box plot, the red dot marks the average age. The thick black line marks the median age. The box is the inner 50% quartiles. The bottom line marks the bottom 25% quartile (28 years old), and the top line marks the upper 75% quartile (42 years old). The circles then mark outliers. For example, there is an outlier at 1 and 72 which are the min and max. It is likely that 1 is a fake age because it is improbable that a toddler works in data science.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   28.00   33.00   36.34   42.00   72.00      10

If US high school graduation is common at age 18, college Bachelor graduation at 21, and Master’s at 23 and the lowest quartile is 28, do most of these data science workers have a Ph.D.?

##             High School Some college/university              Bachelor's 
##                       4                      34                     301 
##                Master's               Doctorate     Professional Degree 
##                     551                     328                      19 
##                   Other 
##                       3

The above graph shows that data science and machine learning workers have a fairly bell-shaped normal distribution centered at a Master’s degree. Most workers have a Master’s degree (44%), then a Doctorate (26%), and then a Bachelor’s (24%). Master’s gap is a huge 1.68x more than Doctorates, 1.83x more than Bachelor’s, and 29x more than a Professional degree.

This Bachelors: Masters: Doctorate ratio is unusual given that data science is not a licensed field were people are required to obtain a certain degree like medical doctors. It is also interesting because for ages 25 and over, the educational attainment in the United States is 8% higher for Bachelor’s, 32% lower for Master’s, and 24% lower for Doctorate when compared to the percentages for data science workers. Thus, the shape and distribution of above graph do not match the shape and distribution of the education for the United States.

Are data science and machine learning employees’ parents similar?

##             High School Some college/university              Bachelor's 
##                     156                     127                     324 
##                Master's               Doctorate     Professional Degree 
##                     321                     163                      82 
##                   Other                    NA's 
##                      33                      34

The parents’ are less education than their children and the graph does not exhibit such a nice symmetry as the data scientist education graph did. It is a goal of many parents to have their children better off than the parents’ had growing up, so these parents are probably happy that their child is more educated than they are.

What was the data scientists’ major?

##                                                              
##                                                          104 
##                                             A health science 
##                                                            9 
##                                      A humanities discipline 
##                                                           44 
##                                             A social science 
##                                                          106 
##                                                      Biology 
##                                                           50 
##                                             Computer Science 
##                                                          177 
##                                       Electrical Engineering 
##                                                          100 
##                           Engineering (non-computer focused) 
##                                                          135 
##                                 Fine arts or performing arts 
##                                                           11 
##                                     I never declared a major 
##                                                            0 
## Information technology, networking, or system administration 
##                                                           25 
##                               Management information systems 
##                                                           17 
##                                    Mathematics or statistics 
##                                                          244 
##                                                        Other 
##                                                           81 
##                                                      Physics 
##                                                          111 
##                                                   Psychology 
##                                                           26

The major graph shows that the plurality of data scientists and machine learning workers studied mathematics or statistics followed by computer science than engineering. Excluding “NA,” 76% had a stem major (Health Science, Biology, Computer Science, Engineering, IT, Management Information Systems, Math, and Physics) and 24% did not (Humanities, Social Science, Fine Arts, Other, Psychology).

Another important characteristic for many people in choosing a profession is the compensation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   76000  110000  120494  150000 2500000      27

When taking the square root of the compensation, a normal distribution appears.

Have data scientist’s salaries been increasing, decreasing or stagnant?

##                                                                   Other 
##                                                                      21 
##         I do not want to share information about my salary/compensation 
##                                                                      34 
##                                          I was not employed 3 years ago 
##                                                                     109 
##                                             I am not currently employed 
##                                                                       2 
##                                               Has decreased 20% or more 
##                                                                      17 
##                                        Has decreased between 6% and 19% 
##                                                                      15 
## Has stayed about the same (has not increased or decreased more than 5%) 
##                                                                     266 
##                                        Has increased between 6% and 19% 
##                                                                     328 
##                                               Has increased 20% or more 
##                                                                     443 
##                                                                    NA's 
##                                                                       5

The figure shows that data scientists and machine learners salaries have been increasing for most of them, and the plurality of them have seen theirs increase by 20% or more.

These workers make a lot more than the median and mean United States income. A next logical step could be what is their job title.

##                                                          Business Analyst 
##                                    0                                   54 
##                   Computer Scientist                         Data Analyst 
##                                   22                                  123 
##                           Data Miner                       Data Scientist 
##                                    7                                  419 
##                DBA/Database Engineer                             Engineer 
##                                   20                                   52 
##            Machine Learning Engineer     Operations Research Practitioner 
##                                   57                                   13 
##                                Other                   Predictive Modeler 
##                                  112                                   26 
##                           Programmer                           Researcher 
##                                   12                                   56 
##                 Scientist/Researcher Software Developer/Software Engineer 
##                                  122                                  108 
##                         Statistician 
##                                   37

The plurality said their title was “Data Scientist,” which is as expected since that is what the survey focused on. “Data Analyst” took the next highest spot by beating out “Scientist/Researcher” by one. The black horizontal line shows the median number of votes for a current job title, which is 52.

Does that title fit them well?

##    Poorly      Fine Perfectly      NA's 
##       159       806       254        21

Most think that the title fits them well.

Are they happy at their job?

## I prefer not to share                     1                     2 
##                    11                    33                    14 
##                     3                     4                     5 
##                    66                    57                   120 
##                     6                     7                     8 
##                   116                   259                   271 
##                     9                    10                  NA's 
##                   170                   122                     1

The column graph shows that most of them are happy. The plurality was at 8. All of the upper numbers, meet or exceed the median number of votes per a satisfaction level, which is marked by the gold line.

Finally, what do they spend their time at work doing?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   40.00   38.36   50.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   10.00   20.00   19.03   25.00   90.00       1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    5.00   10.29   15.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   10.00   13.92   20.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.84   20.00  100.00       2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.579   0.000 100.000       5

Data scientist at work spend most of their time gathering data so how much data do they gather?

##  <1MB   1MB  10MB 100MB   1GB  10GB 100GB   1TB  10TB 100TB   1PB  10PB 
##    29    34    90   173   271   219   143    88    39    11     4     3 
## 100PB   1EB  >1EB  NA's 
##     1     0     1   134

How long have they been writing code?

## I don't write code to analyze data                   Less than a year 
##                                  1                                 40 
##                       1 to 2 years                       3 to 5 years 
##                                186                                379 
##                      6 to 10 years                 More than 10 years 
##                                260                                370 
##                               NA's 
##                                  4

Do they understand their algorithms well?

##                        Enough to run the code / standard library 
##                                                              123 
##                           Enough to tune the parameters properly 
##                                                               97 
##         Enough to explain the algorithm to someone non-technical 
##                                                              504 
##                   Enough to refine and innovate on the algorithm 
##                                                              152 
##   Enough to code it again from scratch, albeit it may run slowly 
##                                                              306 
## Enough to code it from scratch and have it be fast and efficient 
##                                                               51 
##                                                             NA's 
##                                                                7

What algorithms do they plan on learning or using in 2018?

##                                             
##                                          45 
##                           Anomaly Detection 
##                                          50 
##                           Association Rules 
##                                           2 
##                            Bayesian Methods 
##                                          85 
##                            Cluster Analysis 
##                                          28 
##                              Decision Trees 
##                                           8 
##                               Deep learning 
##                                         411 
##   Ensemble Methods (e.g. boosting, bagging) 
##                                          33 
##                             Factor Analysis 
##                                           4 
##           Genetic & Evolutionary Algorithms 
##                                          56 
## I don't plan on learning a new ML/DS method 
##                                          29 
##                               Link Analysis 
##                                           5 
##                                        MARS 
##                                           4 
##                         Monte Carlo Methods 
##                                          23 
##                                 Neural Nets 
##                                         175 
##                                       Other 
##                                          39 
##                      Proprietary Algorithms 
##                                          13 
##                              Random Forests 
##                                          14 
##                                  Regression 
##                                          12 
##                              Rule Induction 
##                                           6 
##                     Social Network Analysis 
##                                          39 
##               Support Vector Machines (SVM) 
##                                          23 
##                           Survival Analysis 
##                                          16 
##                                 Text Mining 
##                                          39 
##                        Time Series Analysis 
##                                          75 
##                             Uplift Modeling 
##                                           6

Note: If you do not know any of the above terms, check the appendix.

What machine learning tool do they plan to use in 2018?

##                                                    
##                                                 44 
##                            Amazon Machine Learning 
##                                                 56 
##                                Amazon Web services 
##                                                 39 
##                                             Angoss 
##                                                  0 
##                                              C/C++ 
##                                                 13 
##                                           Cloudera 
##                                                  2 
##                                          DataRobot 
##                                                 26 
##                                              Flume 
##                                                  1 
##                               Google Cloud Compute 
##                                                 33 
##                                    Hadoop/Hive/Pig 
##                                                 55 
##     I don't plan on learning a new tool/technology 
##                                                 44 
##                                         IBM Cognos 
##                                                  1 
##                                   IBM SPSS Modeler 
##                                                  4 
##                                IBM SPSS Statistics 
##                                                  2 
##                       IBM Watson / Waton Analytics 
##                                                 14 
##                                             Impala 
##                                                  5 
##                                               Java 
##                                                  9 
##                                              Julia 
##                                                 32 
##                                  Jupyter notebooks 
##                                                 37 
##                         KNIME (commercial version) 
##                                                  0 
##                               KNIME (free version) 
##                                                  1 
##                                        Mathematica 
##                                                  2 
##                                      MATLAB/Octave 
##                                                  1 
##                   Microsoft Azure Machine Learning 
##                                                 18 
##                        Microsoft Excel Data Mining 
##                                                  2 
## Microsoft R Server (Formerly Revolution Analytics) 
##                                                  8 
##                   Microsoft SQL Server Data Mining 
##                                                  3 
##                                            Minitab 
##                                                  0 
##                                              NoSQL 
##                                                 10 
##            Oracle Data Mining/ Oracle R Enterprise 
##                                                  1 
##                                             Orange 
##                                                  0 
##                                              Other 
##                                                 76 
##                                               Perl 
##                                                  1 
##                                             Python 
##                                                152 
##                                           QlikView 
##                                                  0 
##                                                  R 
##                                                 78 
##                    RapidMiner (commercial version) 
##                                                  2 
##                          RapidMiner (free version) 
##                                                  4 
##           Salfrod Systems CART/MARS/TreeNet/RF/SPM 
##                                                  0 
##           SAP BusinessObjects Predictive Analytics 
##                                                  1 
##                                           SAS Base 
##                                                  5 
##                               SAS Enterprise Miner 
##                                                  4 
##                                            SAS JMP 
##                                                  2 
##                                      Spark / MLlib 
##                                                109 
##                                                SQL 
##                                                 16 
##                                               Stan 
##                                                 14 
##          Statistica (Quest/Dell-formerly Statsoft) 
##                                                  0 
##                                            Tableau 
##                                                 24 
##                                         TensorFlow 
##                                                286 
##                                     TIBCO Spotfire 
##                                                  0 
##                                   Unix shell / awk 
##                                                  3 
##                                               Weka 
##                                                  0

Further Univariate Analysis

There are many features of interest in this smaller dataset and even more in the broader dataset. This report focuses on gender, age, formal education, recommended programming language, compensation, job satisfaction, and tools and methods to learn. The general statistics or counts of each of those variables were plotted in the univariate section. In the process of plotting that data, temporary new variables were created from existing variables in the dataset.

Four interesting insights from the univariate plots are: 1. Computer Science majors were not the dominant major among data scientists and machine learners. 2. The square root of compensation resulted in a normal distribution. 3. Most data scientists and machine learners say their salaries increased by at least 20% in the last three years. 4. There is a mysteries decrease at 6 to 10 years of coding tenure. Is that decrease still prevalent when coding tenure is divided by gender?

The next segment will investigate and see if the statistics or distributions differ when variables are compared or subdivided with other variables, such as is the median age or compensation differ between males and females.

Bivariate Plots

Based on the univariate plots, there are many interesting possible relationships between variables. First, is there a relationship between compensation and age for data scientists?

## 
##  Pearson's product-moment correlation
## 
## data:  mc_responses_usa$Age and mc_responses_usa$CompensationAmount
## t = 7.3631, df = 1201, p-value = 0.0000000000003327
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1531058 0.2612764
## sample estimates:
##       cor 
## 0.2078264

Note: Geom_Jitter was used because age is a continuous variable but individuals only reported the integer age.

Since correlation (0.2078) is not greater than 0.3, there is no meaningful strength between age and compensation according to the Pearson method; but this weak relationship is significant since it has a p-value <= 0.05.

##   Group.1        Age CompensationAmount
## 1       1  0.4471476         12.2090292
## 2       2  1.2254464          0.2941814
## 3       3 -0.5685766         -0.1955254

This is likely because compensation typically increases over time and age always increases over time, so these two variables are typically also going in the same direction. However, being a certain age does not mean you will receive a certain income.

## mc_responses_usa$Gender: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   90000   90000   90000   90000   90000   90000 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: A different identity
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20000  115000  135000  115600  150000  158000       1 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     185   65000   90000  100065  130000  330000       8 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   80000  110000  125427  150000 2500000      18 
## -------------------------------------------------------- 
## mc_responses_usa$Gender: Non-binary, genderqueer, or gender non-conforming
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5000   39000   97300  110825  185000  235000

The gender-compensation plot shows that males have a higher median and average salary than females. They also have a higher max, third quartile, and first quartile. Since only six and eight respondents identified as “a different identity” or “non-binary, gender-queer, or gender non-conforming,” statistically significant decisions cannot be made about their compensation.

## mc_responses_usa$RecommendedLanguage: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1   55000   83000   95931  135000  200000       1 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: C/C++/C#
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5000   66000  104500  120125  160250  350000       1 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: F#
## NULL
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Haskell
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40000   62500   85000   85000  107500  130000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Java
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78000   85000   86000  117000  132500  220000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Julia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  210000  407500  605000  605000  802500 1000000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Matlab
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11500   94000  106000  117158  140000  240000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   70000   82125   94000  117688  156250  200000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Python
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   80000  110000  125335  150000 2500000      20 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: R
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   70000   97500  110730  135000  550000       4 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: SAS
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   65000   70000  100000  120385  140000  300000       1 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Scala
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  108000  122500  140000  140429  157500  175000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: SQL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   71500  108000  104527  130000  220000 
## -------------------------------------------------------- 
## mc_responses_usa$RecommendedLanguage: Stata
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   73000   95000   81500  103500  114000

The above plot shows Python is the most recommended computer language followed by R then SQL. Python recommenders’ salary is higher than R recommenders.

It will be insightful to know if there is a significant difference between compensation level per a language and compensation level per a recommended language. This Kaggle study did not track that. It is also possible that people recommended both Python and R equally, which Kaggle did not tracked. Lastly, the study also used “recommends” so it is possible that a data scientist recommended something they do not use.

Since “blank” or “no language” had 30 respondents, 30 was used as the cut-off level of required minimum number of responses to provide useful data. This left only Python (783 respondents), R (283), and SQL (64) with enough of respondents for insights to be concluded.

Out of these three languages, the largest median income per recommended language is Python (110,000) then SQL (108,000) then R (97,500). On average, it is Python (125,335), then R (110,730), then SQL (104,527).

What’s the breakdown of education by job satisfaction?

Is this breakdown similar with parents’ education?

## mc_responses_usa$ParentsEducation: High School
##             High School Some college/university              Bachelor's 
##                       1                       9                      37 
##                Master's               Doctorate     Professional Degree 
##                      66                      41                       2 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Some college/university
##             High School Some college/university              Bachelor's 
##                       0                       9                      30 
##                Master's               Doctorate     Professional Degree 
##                      58                      26                       4 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Bachelor's
##             High School Some college/university              Bachelor's 
##                       1                       5                      75 
##                Master's               Doctorate     Professional Degree 
##                     155                      84                       2 
##                   Other 
##                       2 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Master's
##             High School Some college/university              Bachelor's 
##                       1                       3                      82 
##                Master's               Doctorate     Professional Degree 
##                     152                      80                       3 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Doctorate
##             High School Some college/university              Bachelor's 
##                       0                       1                      41 
##                Master's               Doctorate     Professional Degree 
##                      65                      55                       1 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Professional Degree
##             High School Some college/university              Bachelor's 
##                       0                       2                      20 
##                Master's               Doctorate     Professional Degree 
##                      30                      25                       5 
##                   Other 
##                       0 
## -------------------------------------------------------- 
## mc_responses_usa$ParentsEducation: Other
##             High School Some college/university              Bachelor's 
##                       1                       3                       9 
##                Master's               Doctorate     Professional Degree 
##                      10                       7                       2 
##                   Other 
##                       1

In regards to the above correlogram, Spearman method was used because the data was assumed to be non-linear.

Therefore, the null hypothesis is that the Spearman correlation coefficient, rho, is 0. A rho of 0 means that the ranks of one variable do not covary with the ranks of the other variable. In other words, as the ranks of one variable increase, the ranks of the other variable do not increase or decrease. Also, in the graph, a small p-value (typically ??? 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

Excluding the forced correlations with the time spent doing different tasks at work, some logical insights from the correlation matrix are:

  1. Formal education and age are positively correlated as expected (rho = 0.272, p < .001).

  2. Coding tenure and age have the strongest correlation at 0.57, followed by tenure and compensation amount at 0.42, then compensation amount and age at 0.391, and then coding tenure and education at 0.301.

  3. Work dataset size and machine learning tool to learn next year is not unexpected to correlate (rho 0.069, p -.028) as people’s decision on what tool to learn could be affected by how much data they have to process. Similar logic can be applied to recommended language and work dataset size (rho -0.095, p -0.002).

  4. Time model building is correlated to coding tenure, major, formal education, and recommended language next year as possibly expected. (It also showed correlation with Title Fit which is not expected.)

  5. Recommended language correlated with all the work time variables except “time other” and “time gathering data.”

  6. Job satisfaction is correlated with job title fit, salary change, and compensation amount.

  7. Salary change is correlated to coding tenure and compensation amount.

  8. Lastly, there is also a correlation between compensation and: age, gender, education (but not parents’ education), coding tenure, and algorithm understanding level.

There were more variable correlations, but many began to stretch what most people would consider justified. For instance, tenure and work data size was correlated. An argument could be made that the better coders have coded longer, i.e., longer coding tenure. Since they are better coders, they are assigned or chose to work with larger dataset sizes at work. However, others might say that work dataset size is independent of a coders tenure. The dataset size is more contingent on how many people use the service that is collecting the data, or how many respond to a specific survey or the dataset bought by the coder’s employer.

Further Bivariate Analysis

The box plot, before adjusting the y-axis, showed outliers near 100 years old for “a different identity,” “males,” and “gender-queer, or gender non-conforming.” There was no outlier at this high age range for females and individuals who did not specify a gender. This highlights two of the most difficult data collection issues - complete answers and the integrity of the answers. Since 100 years old are outliers, it is also skeptical that these individuals are 100 years old and did not enter an incorrect age.

The correlation matrix displays all the correlations between two variables. Excluding time-dependent variables (age, coding tenure, work time tasks), the strongest relationship was between salary change and compensation amount at .231, then job satisfaction and title fit at .221, and finally, the unexpected correlation between compensation amount and work dataset size at .176.

The statistically significant correlation between compensation amount and gender matches the gender-pay inequality that is sometimes heard in the news. Fortunately, it is only a weak correlation of 0.129.

Multivariate Plots

The densest section around the data scientist’s education of Masters and the job title fit of “fine” shows that the majority of respondents had a Masters degree as we saw in an earlier graph. The lack of a clear grouping of color at either number of data scientist education level leads readers to believe that there is no correlation between parents education and job satisfaction or parents education and data scientist’s education. The few points near the low numbers of job satisfaction means most respondents were satisfied with their job. Lastly the grouping near 10 and perfectly for Doctorate and the grouping near 1 and poorly in Master’s makes it seem like job title fit, and job satisfaction could be correlated, which according to the correlation matrix they are.

Further Multivariate Analysis

A surprising insight that the lack of influence of Parents Education on variables. It only was correlated with age and title fit weakly. It was expected that parents education would be correlated with offspring’s/data scientist’s education too.

A linear regression model and a K-means cluster model were created which are in the “Final Plots and Summary” section.

Final Plots and Summary

The insights drawn from the numerous plots and statistics of this public dataset can aid both employees, employers, and others, by helping people determine what to learn, what to include in job requirements, what to title job positions, and what salary level to began compensation packages. To continue emphasizing key info about the data science and machine learning industry, two multivariate plots and a cluster analysis are highlighted in the summary.

Lastly, some tips drawn from the data insights are:

  1. Learn Python, R, and SQL as they are the most used languages by the data scientists.

  2. Learn deep learning and neural nets as they will be the most sought-after techniques in the future.

  3. Develop skills for gathering data as it can be the most time-consuming process in the workflow of a data scientist.

  4. Statistics and mathematics are vital to understanding how certain algorithms work.

Immutable Attributes

Age, gender, and compensation can not easily be changed by an employee. Compensation is not considered easy to change because it is determined by the employer.

The above graph shows the dominance of male (blue) over females (light red) responses in the survey. It also makes some interesting outliers evident; such as the female that is around 70 years old but one of the lowest paid or the 25-year-old male that is one of the highest paid.

K-Means Clustering

To analyze the point variability of two of these immutable attributes, a K-Means cluster analysis was conducted.

A k of 3 was determined because 2 and 4 cluster groups explained a smaller percent of point variability.

Mutable Attributes

The level which an employee understands something and the programming tools or languages that you recommend can easily be changed by an employee.

## 
## Call:
## lm(formula = MLMnum ~ AlULnum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.278  -4.278  -4.097   3.813  14.903 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 11.05200    0.55126  20.049 <0.0000000000000002 ***
## AlULnum      0.04514    0.14861   0.304               0.761    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.902 on 1231 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  7.494e-05,  Adjusted R-squared:  -0.0007373 
## F-statistic: 0.09226 on 1 and 1231 DF,  p-value: 0.7614

The chart above shows the within Python, the most popularly recommended language among data scientists and machine learners, those that also recommended “Deep Learning” also have the most people that know enough to code algorithms from scratch and have it be fast and efficient.

Reflection

The US data science and machine learning dataset contains information on 1240 respondents across 22 variables from 2017. I started by understanding the individual variables in the dataset in the univariate section, and then I explored exciting leads as I continued to make observations on plots, apply statistical concepts, and described relationships in data. Eventually, I explored compensation across many variables and created a K-means cluster graph.

There was never strong trends between variables, but there was a lot of statistically significant correlations. I was also surprised that algorithm understanding level correlated the most with variables by correlating with 12 out of 21 possible variables. The next closest was job satisfaction which correlated with nine other variables.

In future work, data science can benefit by learning the job search duration. Job hunt time was included in the Kaggle survey but was not an answered question in the United States. Additionally, the breakdown of data scientists by race/ethnicity and sexual orientation can also provide interesting insight.

The respondents of Kaggle’s survey were most likely Kaggle users which means the resultant dataset is biased towards those type of respondents. It is unknown if data scientists that use Kaggle are significantly different from data scientists that do not use Kaggle.

Appendix

Terminology

Association Rules - rule-based machine learning method for discovering interesting relations between variables in large databases.[3]

Bayesian Methods - a statistical inference method in which Bayes’ theorem is used to update the probability of a hypothesis as more data becomes available.

Convolutional Neural Network - a class of feed-forward neural networks comprised of one or more convolutional layers and then followed by one or more fully connected layers that have successfully been utilized for analyzing imagery.

Cluster Analysis - grouping a set of items in such a way that objects in the same cluster are more similar (in some sense) to each other than to those in other clusters.

Collaborative Filtering - a technique that has a narrow and general sense and has been used by recommender systems.

Cross-Validation - a model validation method for assessing how the outcomes of a statistical analysis will generalize to an independent dataset.

Decision Trees - a decision support tool that uses a model of decisions or tree-like graph and their possible outcomes, including resource costs, utility, and chance event outcomes.

Deep Learning - a subset of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

Dimensional Modeling - set of concepts and techniques used in data warehouse design.

Ensemble Methods (e.g., boosting, bagging) - use of diverse learning algorithms to achieve better predictive performance than could be gained from any of the constituent learning algorithms alone.

Factor Analysis - a statistical technique to describe variability among observed, correlated variables concerning a possibly lower number of unobserved variables named factors. For instance, it is possible that variations in eight observed variables mainly reflect the changes in two underlying, unobserved variables.

Genetic and Evolutionary Algorithms - evolutionary algorithm (EA) includes genetic algorithms and uses mechanisms inspired by biological evolution, such as mutation, reproduction, selection, and recombination. EA is a subset of a generic population-based metaheuristic optimization algorithm. A genetic algorithm is a class of evolutionary algorithm. Although genetic algorithms are a frequently encountered type of evolutionary algorithm, there are other types.

Link Analysis - a data-analysis technique used to evaluate relationships between nodes. Relationships may be identified among various types of nodes, including organizations, people, and transactions.

Multivariate Adaptive Regression Splines - a non-parametric regression technique that automatically models nonlinearities and interactions between variables.

Monte Carlo Methods - a set of computational algorithms that depend on repeated random sampling to obtain numerical results.

Neural Nets - a system of data structures and programs that approximates the operation of the human brain. A neural network ordinarily involves many processors operating in parallel.

Principal Components Analysis - a statistical operation that uses an orthogonal transformation to convert a batch of potentially correlated variables into a group of linearly uncorrelated variables termed principal components.

Random Forests - a composite learning method for regression, classification, and other tasks, that operates by constructing a multitude of decision trees at training time and outputting the class that is the mean prediction (regression) of the individual trees or mode of the classes (classification).

Linear Regression - a linear approach for modeling the relationship between one or more explanatory variables and a scalar dependent variable.

Rule Induction - an area of machine learning in which formal rules are extracted from a set of observations. The rules obtained may represent a full scientific model of the data, or merely local patterns in the data.

Social Network Analysis - a process of examining social structures through the use of networks and graph theory. It characterizes networked structures as nodes (individual actors, things, or people within the system) and the ties, edges, or links (interactions or relationships) that connect them.

Support Vector Machines (SVM) - a discriminative classifier formally defined by a separating hyperplane; i.e., given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.

Survival Analysis - a division of statistics for analyzing the expected time until one or more events happen

Text Mining - the process of deriving high-quality information from text typically by devising trends or patterns through methods like statistical pattern learning.

Time Series - a sequence of data points indexed in time order. Since a time series is commonly a sequence taken at successive equally spaced intervals, it is a sequence of discrete-time data.

Uplift Modeling - a predictive modeling method that directly illustrates the incremental impact of a treatment, such as a marketing action, on an actor’s behavior.

Works Cited

  1. Kaggle. “Kaggle ML and Data Science Survey, 2017.” Kaggle, 2017. Web. 12 January 2018. https://www.kaggle.com/kaggle/kaggle-survey-2017/.

  2. The percentages, which are calculated based on Census data by counting people that had attained that level or higher. add up to more than 100% because they are cumulative. For example, it is assumed that all people with doctorates also have undergraduate and high school degrees, and are thus counted twice in the “lower” categories. “Educational Attainment in the United States: 2014”. U.S. Census Bureau. Retrieved January 29, 2015.

  3. Piatetsky-Shapiro, Gregory (1991), Discovery, analysis, and presentation of strong rules, in Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA.

>>>>>>> 4d92d2f0050b1feeb8ee1a2c8821a9538f6e4333